The Wine Geek Analysis

by Jean Phelippe Ramos de Oliveira ()

Abstract

We know that taste is very difficult to map and understand as different people have different preferences. However, at the same time, it’s well known that wines can have different levels of quality that yield a whole span of different prices. The goal of this document is to present a technical analysis over physical and chemical variables from Portuguese wines and try to shed some light on the relationships between physicochemical properties of wine and the rates given to the wines by wine experts.

We have deleted the variable listing the IDs.

Univariate Plots Section

Below we can find a summary statistics of all variables:

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

In our case, there is only one categorical variable which is quality (integers between 0 and 10).

Let’s start by analyzing the quality:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

As we can see, the median is around the rating 6.0 with the best wine rated at 8.0. Which means that’s very difficult to find good wines (>7) in this dataset as per statistical analysis. The plot shows that majority of wines are rates between 5 or 6. This also means that our analysis may be biased towards lower quality wines.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

The minimum alcohol percentage by volume is 8.4 and it ranges up to 14.90. The dataset is skewed towards the range of 10%, which is reasonable for most wines I know. I wonder if alcohol levels are actually relevant for a good rating as stronger wines may be more difficult to appreciate. Also it is a substance that evaporates easily and can affect smell.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

Free SO2 is also spread across a wide range. The minimum is 1 and maxiumum goes to 72 mg/dm^3. There is a higher concentration in the range of 7-20 mg/dm^3. This substance is related to oxidation of the wine and it’s well known that the oxidation affects the wine taste.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

Total SO2 has a wide range of values with the Maximum at 289mg/dm³. Minimum at 6mg/dm³ and mean around 46mg/dm³ (highr than free SO2 as expected once free SO2 should be a part of the total)

After applying a log10 transformation to the plot we can now reduce the effect of the long tail and clearly see a more normal distribution around the value of 30g/dm³

Fixed and Volatile acidity show up with two peaks around 7-8g/dm³ for fixed and 0.4-0.6g/dm³. Fixed acidity ranges from 4.6 up to almost 16g/dm³ but the mean is around 8g/dm³. The Volatile acidity ranges from 0.12-1.58g/dm³ with Mean around 0.52g/dm³. Higher levels of volatile acidity (usually linked with acetic acid) can yield a bad taste (vinegar-like taste). The amount of non-volatile acid is way higher, because they compose the bulk of acidity on a wine.

Citric acid levels can be found at higher range than acetic acid, but they are still small compared to non-volatile. It ranges from 0-1g/dm³.

pH levels range from 2.74 to 4.01 which are within the acid range exactly as we expect from wines.

Chlorides usually are liked to the salty taste and should be found on small quantities. They range from 0.012 up to 0.611 with median around 0.08g/dm³. It can vary up to 50 time from the minimum up to maximum values on this dataset.

Higher quality wines seem to have a more concentrated range of chlorides <0.1 but we still need to dive deeper into the data to check if there’s any relationship between variables.

Univariate Analysis

What is the structure of your dataset?

##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"

Variables:

  1. fixed acidity (tartaric acid - g / dm^3)
  2. volatile acidity (acetic acid - g / dm^3)
  3. citric acid (g / dm^3)
  4. residual sugar (g / dm^3)
  5. chlorides (sodium chloride - g / dm^3
  6. free sulfur dioxide (mg / dm^3)
  7. total sulfur dioxide (mg / dm^3)
  8. density (g / cm^3)
  9. pH
  10. sulphates (potassium sulphate - g / dm3)
  11. alcohol (% by volume)
  12. quality (score between 0 and 10)

According to the list above, we have 12 variables in which the first 11 are physicochemical properties that were measured and a final variable called quality that was a rate from 0 (very bad) to 10 (excellent) given by at least 3 wine experts. There are 1599 observations in this dataset. Most of the wines are in average quality (5-6 quality rates).

What is/are the main feature(s) of interest in your dataset?

The main feature that is common to our knowldege and of many people that appreciate drinks is the alcohol level. We’d like to evaluate the correlation between alcohol levels and quality and use other variable to understand the dynamic of the quality as we believe that there’s a limit in which the alcohol level can affect taste positively.

What other features in the dataset do you think will help support your

investigation into your feature(s) of interest?

It’s a geeky analysis as there can be way more other variables that can affect the tast of wine such as temperature and how long the wine was in contact with oxygen. However, for this analysis we really want to focus on physicochecmical properties. We think that looking at common properties that we analyze in a wine such as alcohol percentage, acidity levels (including pH), total sulfur dioxide (related to oxidation) and residual sugar (related to sweetness) may be the most impactful ones.

Did you create any new variables from existing variables in the dataset?

Based on the description of the dataset, turns out that wines with levels of free SO2 higher than 50ppm (or mg/dm³) may have their taste affected. The SO2 mainly helps prevent oxidation, but can affect the taste if in high contration. Thus, we’ve created a categorical variable called level.free.sulfur.dioxide that splits the dataset into High(>50ppm) and Low (<=50ppm) levels of free SO2.

Of the features you investigated, were there any unusual distributions?

Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Fixed and volatile acidity have shown 2 peaks each. Whereas Total SO2 has shown a long tail shape. We applied the log10 transform to total SO2 in order to reduce the skewness from the long tail and try to visualize it a more normal shape.

Bivariate Plots Section

Tip: Based on what you saw in the univariate plots, what relationships between variables might be interesting to look at in this section? Don’t limit yourself to relationships between a main output feature and one of the supporting variables. Try to look at relationships between supporting variables as well.

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity           1.00000000     -0.256130895  0.67170343
## volatile.acidity       -0.25613089      1.000000000 -0.55249568
## citric.acid             0.67170343     -0.552495685  1.00000000
## residual.sugar          0.11477672      0.001917882  0.14357716
## chlorides               0.09370519      0.061297772  0.20382291
## free.sulfur.dioxide    -0.15379419     -0.010503827 -0.06097813
## total.sulfur.dioxide   -0.11318144      0.076470005  0.03553302
## density                 0.66804729      0.022026232  0.36494718
## pH                     -0.68297819      0.234937294 -0.54190414
## sulphates               0.18300566     -0.260986685  0.31277004
## alcohol                -0.06166827     -0.202288027  0.10990325
## quality                 0.12405165     -0.390557780  0.22637251
##                      residual.sugar    chlorides free.sulfur.dioxide
## fixed.acidity           0.114776724  0.093705186        -0.153794193
## volatile.acidity        0.001917882  0.061297772        -0.010503827
## citric.acid             0.143577162  0.203822914        -0.060978129
## residual.sugar          1.000000000  0.055609535         0.187048995
## chlorides               0.055609535  1.000000000         0.005562147
## free.sulfur.dioxide     0.187048995  0.005562147         1.000000000
## total.sulfur.dioxide    0.203027882  0.047400468         0.667666450
## density                 0.355283371  0.200632327        -0.021945831
## pH                     -0.085652422 -0.265026131         0.070377499
## sulphates               0.005527121  0.371260481         0.051657572
## alcohol                 0.042075437 -0.221140545        -0.069408354
## quality                 0.013731637 -0.128906560        -0.050656057
##                      total.sulfur.dioxide     density          pH
## fixed.acidity                 -0.11318144  0.66804729 -0.68297819
## volatile.acidity               0.07647000  0.02202623  0.23493729
## citric.acid                    0.03553302  0.36494718 -0.54190414
## residual.sugar                 0.20302788  0.35528337 -0.08565242
## chlorides                      0.04740047  0.20063233 -0.26502613
## free.sulfur.dioxide            0.66766645 -0.02194583  0.07037750
## total.sulfur.dioxide           1.00000000  0.07126948 -0.06649456
## density                        0.07126948  1.00000000 -0.34169933
## pH                            -0.06649456 -0.34169933  1.00000000
## sulphates                      0.04294684  0.14850641 -0.19664760
## alcohol                       -0.20565394 -0.49617977  0.20563251
## quality                       -0.18510029 -0.17491923 -0.05773139
##                         sulphates     alcohol     quality
## fixed.acidity         0.183005664 -0.06166827  0.12405165
## volatile.acidity     -0.260986685 -0.20228803 -0.39055778
## citric.acid           0.312770044  0.10990325  0.22637251
## residual.sugar        0.005527121  0.04207544  0.01373164
## chlorides             0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide   0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide  0.042946836 -0.20565394 -0.18510029
## density               0.148506412 -0.49617977 -0.17491923
## pH                   -0.196647602  0.20563251 -0.05773139
## sulphates             1.000000000  0.09359475  0.25139708
## alcohol               0.093594750  1.00000000  0.47616632
## quality               0.251397079  0.47616632  1.00000000
pairs.panels(data, scale=TRUE)

After checking the correlation between all variables in the dataset we could see some high correlation between certain variabels:

We can see the high positive correlation between fixed acidity and density. However a bit scattered in the central part.

We can see some correlation between citric acid and fixed acidity. Not as strong as others but still there.

A clear correlation between fixed acidity and pH. Higher fixed acidity yields lower pH, which is the definition of that index.

Total and free sulfur dioxide also present some positive correlation as free sulfur dioxide levels depends on the molecular levels of sulfur dioxide.

As we can see from above, the definition of High vs Low Free SO2 correlates with the 99th percentile of free sulfur dioxide. In other words, the cases with High Free SO2 are outliers and don’t represent the dataset. Therefore, we have pretty much all wines under the 50ppm defined as ok amount of Free SO2 by the description of the dataset. From now on we won’t use the variable created as it doesn’t bring much value.

Bivariate Analysis

Tip: As before, summarize what you found in your bivariate explorations here. Use the questions below to guide your discussion.

Talk about some of the relationships you observed in this part of the

investigation. How did the feature(s) of interest vary with other features in
the dataset?

  • Higher alcohol levels are correlated to higher quality.

  • Volatile acidity is correlated (negatively) with quality. From the plots above we can clearly see that higher quality wines have lower volatile acidity in a more controlled range.

Did you observe any interesting relationships between the other features

(not the main feature(s) of interest)?

  • Fixed acidity and density seems to have a strong positive correlation.

  • Fixed acidity and Citric acid levels seem to have also very positive correlation. It’s expected as fixed acidity is related to the acids in the wine.

  • Fixed acidity levels and pH have a strong correlation as well but negative, which is more than expected as higher acidity levels are linked to lower pH level by definition.

  • Total SO2 and Free SO2 also seems to have positive correlation and it’s expected as more SO2 will eventually lead to more free SO2 on chemical mixtures. The degree will depend on the oxidation levels of the mixture, but it seems that for wines it’s pretty much correlated.

  • Interestingly enough, density and alcohol also are highly correlated (negatively). Higher levels of alcohol yields lower density wines. It’s very scattered so the correlation is not strong as previous ones. But still very correlated (-0.5).

What was the strongest relationship you found?

The strongest relationship found is between fixed acidity and pH, which is expected as pH is function of the acidity levels of a substance. Also we have strong correlation between total sulfur dioxide and free sulfur dioxide levels, which also makes sense as both depends on the levels of sulfur dioxide in the substance at a molecular levels.

Multivariate Plots Section

Tip: Now it’s time to put everything together. Based on what you found in the bivariate plots section, create a few multivariate plots to investigate more complex interactions between variables. Make sure that the plots that you create here are justified by the plots you explored in the previous section. If you plan on creating any mathematical models, this is the section where you will do that.

Higher quality wines tend to stay on the lower right part of the plot, which means higher fixed acidity and lower volatile acidity.

Higher levels of alcohol and fixed acity correlates with higher quality wines, we can see the wines >=7 in terms of quality stays on the top half of the plot.

Good wines concentrate on the left side of the plot, we can see some correlation between higher quality wines and lower levels of total sulfur dioxide, in terms of free sulfur dioxide it’s not that clear but they tend to stay in a region of lower concentration.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the

investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

If we look at volatile acidity vs fixed acidity vs quality we can see there is a “sweet spot” area on the bottom right corner where lies the top quality wines (quality = 8) surronded by wines quality 5, 6 and 7. It’s not 100% correlated but we can see a higher density of 7’s and 8’s down there compared to the opposite corner with higher volatile acidity and lower fixed acidity.

Also, if we look at the alcohol vs fixed acidity vs quality we can see that there is also another area with higher concentration of top quality wines (specially if we focus on 7’s [yellow dots] and 8’s [black dots]) on the top right corner with higher alcohol percentage and higher fixed acidity. Which aligns with our analysis from previous sections. It seems that alcohol levels and fixed acidity contributes for a higher quality wine perception. If we do the inverse excercise and look at the bottom part of the chart we can see almost no wines above 6 in terms of quality.

Finally, when looking at free sulfur dioxide vs total sulfur dioxide vs quality we can see majority of wines concentrate in a range of <25 for free sulfur dioxide but without much correlation with quality, which makes sense as the property is just related to conserving the wine. On the other hand, we can see higher quality wines staying at lower levels of total sulfur dioxide. It’s known that too high levels of sulfur dioxide may alter the taste of wine.

Were there any interesting or surprising interactions between features?

It seems there’s a cluster towards lower levels of free sulfur dioxide and total sulfur dioxide. It makes sense as sulfur dioxide is used mainly for conservation purposes. Therefore, it should have a standard concentration amongst wines with a delimited range. More than a certain concentration would only ruin the wine, lower would make it too perishable.

OPTIONAL: Did you create any models with your dataset? Discuss the

strengths and limitations of your model.


Final Plots and Summary

Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your dataset. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.

Plot One

Description One

From all the properties analyzed, Alcohol levels and density have the strongest correlation (approximately 0.67). The Plot One demonstrates the correlation and has a linear model that shows the trend and error based on geom_smooth function from the library ggplot2 as a representation of the correlation.

Plot Two

Description Two

Plot Two shows us the correlation between higher quality wines, lower levels of volatile acidity and higher levels of fixed acidity. In other words, the higher quality wines concentrate at the bottom right corner of the plot.

Plot Three

Description Three

The variable with strongest correlation to quality was alcohol levels. From the Plot Three we can see that higher quality wines (quality levels around 7 and 8) have higher alcohol % over volume. It seems that this variable has a strong relavance on defining the wine quality.

Reflection

Tip: Here’s the final step! Reflect on the exploration you performed and the insights you found. What were some of the struggles that you went through? What went well? What was surprising? Make sure you include an insight into future work that could be done with the dataset.

We have performed an Exploratory Data Analysis over the dataset related to red variant of the Portuguese “Vinho Verde” wine. We tried to first understand the dataset and it’s variables, look for validations, for instance the low pH levels as expected from wines due to acidity. Similarly, we looked into trends and how the quality of the wines was distributed. Aftewards, we started to look for correlations and trends between different variables. Finally, we looked at those same trends and their correlation with quality, the main variable of interest in this dataset. Below we can find few highlights over this process.

For a next step, we would like to look into prices, costs and brands and how they would affect the quality evaluations. Also, trying to understand better the evalutaion process given the weather and temperature as it definely affects the sensorial experiences.

Citation: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.